Posters - Schedules

Posters Home

View Posters By Category

Wednesday, November 9, 2022 between 8:30 AM - 9:30 AM
Thursday, November 10, 2022 between 8:30 AM - 9:30 AM
Friday, November 11, 2022 between 8:30 AM - 9:30 AM
01: Sequence-based correction of barcode bias in massively parallel reporter assays
COSI: rsg
  • Dongwon Lee, Boston Children's Hospital & Harvard Medical School, United States
  • Ashish Kapoor, University of Texas Health Science Center at Houston, United States
  • Changhee Lee, Harvard Medical School, United States
  • Michael Mudgett, Johns Hopkins University, United States
  • Michael Beer, Johns Hopkins University, United States
  • Aravinda Chakravarti, NYU Grossman School of Medicine, United States


Presentation Overview: Show

Massively parallel reporter assays (MPRAs) are a high-throughput method for evaluating in vitro activities of thousands of candidate cis-regulatory elements (CREs). In these assays, candidate sequences are cloned upstream or downstream from a reporter gene tagged by unique DNA sequences. However, tag sequences may themselves affect reporter gene expression and lead to major potential biases in the measured cis-regulatory activity. Here, we present a sequence-based method for correcting tag-sequence-specific effects and show that our method can significantly reduce this source of variation and improve the identification of functional regulatory variants by MPRAs. We also show that our model captures sequence features associated with post-transcriptional regulation of mRNA. Thus, this new method helps not only to improve detection of regulatory signals in MPRA experiments but also to design better MPRA protocols.

02: Hi-C chromatin interaction networks reveal classes of chained enhancers
COSI: rsg
  • Dylan Barth, UNLV, United States
  • Mira Han, UNLV, United States


Presentation Overview: Show

Since the invention of Hi-C technology, the functional role of physical contact between enhancer and promoter elements has been of great scientific interest. However, contact as measured by Hi-C data alone is not sufficient to explain which enhancer elements are functional, in fact some enhancers that significantly regulate a gene’s expression have no significant contact with its promoter. Using a combination of Hi-C maps, ChIP-Seq data from the ENCODE database, and a dataset of CRISPRi perturbed enhancers and their relative effect on gene expression, we developed a machine learning pipeline to identify important features of enhancer elements with and without direct contact to their respective genes. We found that functional enhancers that are not in direct contact with the promoters have distinct TF binding patterns compared to the ones directly contacting the promoter.

03: Identification of circular RNAs and sequence-based rules underlying back-splicing
COSI: rsg
  • Lisa Shrestha, The University of Alabama at Birmingham, United States
  • Andre Leier, The University of Alabama at Birmingham, United States


Presentation Overview: Show

Circular RNAs (circRNAs) are a class of long non-coding RNAs. While they have been known for decades, only in recent years scientists have started to realize their role in health and diseases. CircRNAs are characterized by a covalently closed loop structure that is formed by a back-splicing event. Consequently, they lack a 5’-3’ polarity and a poly-adenylated tail. Their circular structure makes circRNAs resistant to exonuclease degradation and hence more stable than their cognate linear RNAs – a property that augments their potential as effective disease biomarkers and therapeutic molecules. In recent years, thousands of circRNAs have been identified, and many more are likely waiting to be discovered.

Recent studies have utilized machine learning (ML) methods to understand the underlying mechanisms driving back-splicing that could help identify novel circRNAs. Few tools that employ deep learning (DL) have achieved accuracies, sensitivities, and specificities of over 95% but have lacked interpretable results. We have tested different physiochemical properties of RNA sequence features using conventional, interpretable ML algorithms and were able to achieve similar performances compared to DL-based tools. Feature analysis of our best-performing ML approach identified the relative importance of certain nucleotides determining back-splicing and circular RNA formation.

04: METHYENRICH - GENE SET AND TRANSCRIPTION FACTOR ENRICHMENT DESIGNED FOR DNA METHYLATION HIGH THROUGHPUT SEQUENCING DATA.
COSI: rsg
  • Shiting Li, University of Michigan, United States
  • Elysia Chou, University of Michigan, United States
  • Tingting Qin, University of Michigan, United States
  • Maureen Sartor, University of Michigan, United States


Presentation Overview: Show

Introduction:
DNA methylation is an essential epigenetic mark that regulates gene expression by inhibiting or promoting transcription factor (TF) binding at regulatory elements across the genome. Whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS) are CpG site resolution sequencing methods to evaluate DNA methylation profiles. Although many tools are designed for these two methods to study differential methylation and site annotation, tools for performing downstream gene set enrichment (GSE) testing for differential methylation are still lacking. Although we and others have developed methods for GSE testing of large sets of genomic regions, none consider the location of CpG sites or their effect on TF binding.

Methods:
We collected WGBS, ChIP-seq, and ATAC-seq data from cell lines to study transcription factor binding affinity at methylated versus unmethylated DNA. For each cell line, ChIP-seq and ATAC-seq were used to identify bound transcription factor binding sites (TFBS) and unbound TF motif locations, and we studied the methylation patterns of bound vs unbound sites to build a genome-wide reference for enrichment testing. The core test is a generalized linear model with gene-level weights calculated by combining the differential methylation in gene regulatory regions, including enhancers, with the TF effects in our reference. We also output the methylation effect of specific TFs on each gene set term. For evaluation, paired transcriptomic and methylation sequencing data are used to check the captured gene set terms, and we compare our method with alternative approaches using ROC and AUC values.

Results:
We finished most of the reference data preprocessing steps, Methyenrich R package coding, and performed a series of evaluations on differential methylation in cancer subtypes and human tissues. Using only the subset of differentially methylated sites in TFBSs of 447 TFs from Chip-seq and TFBSs of 971 TFs from ATAC-seq, MethyEnrich results are comparable to tools that use all differentially methylated regions, such as GREAT or Poly-Enrich. Those tests also indicate that the comprehensive inclusion of TFBSs across the genome is crucial for method performance. Future work will focus on identifying more TFBS, combining Chip-seq and ATAC-seq references, and implementing a statistical method to extract each gene set‘s TF perturbations from methylation changes.

05: scMultiSim: simulation of multi-modality single cell data guided by cell-cell interactions and gene regulatory networks
COSI: rsg
  • Hechen Li, Georgia Institute of Technology, United States
  • Ziqi Zhang, Georgia Institute of Technology, United States
  • Michael Squires, Georgia Institute of Technology, United States
  • Xi Chen, Southern University of Science and Technology, China
  • Xiuwei Zhang, Georgia Institute of Technology, United States


Presentation Overview: Show

Simulated single-cell data is widely used to aid in designing and benchmarking computational methods due to the scarcity of experimental ground truth. Recently, an increasing number of computational methods have been developed to address various computational problems with single-cell data, including cell clustering, trajectory inference, integration of data from multiple batches and modalities, inference of gene regulatory networks (GRNs) and cell-cell interactions. Simulators that are designed to test a certain computational problem model only the particular factors that affect the output data; whereas modelling as many biological factors as possible into the simulation allows the generated data to have realistic complexity and can be used to benchmark a wider range of computational methods. Here, we present scMultiSim, an in silico simulator that generates multi-modality data of single-cells, including gene expression, chromatin accessibility, RNA velocity, and spatial cell locations while accounting for the relationships between modalities. We proposed a unified framework to jointly model biological factors including cell-cell interactions, with-in-cell GRNs and chromatin accessibility, so all their effects simultaneously present in the output data. Users enjoy unprecedented flexibility by having full control of the cell population and the ability to fine-tune each factor’s effect on the underlying model. We also provide options to simulate technical variations including batch effects to make the output resemble real data. We verified the simulated biological effects and demonstrated scMultiSim’s applications by benchmarking four computational tasks on single-cell multi-omics data: GRN inference, RNA velocity estimation, integration of single-cell datasets from multiple batches and modalities, and analysis of cell-cell interaction using the cell spatial location data. To our knowledge, scMultiSim is the only simulator of single cell data that can perform benchmarking for all these four challenging computational tasks.

06: The role of Myristoylated, Alanine-rich C-kinase Substrate (MARCKS) in macrophages upon stimulation of Toll-like receptor 4.
COSI: rsg
  • Jiraphorn Issara-Amphorn, Functional Cellular Networks Section, LISB, NIAID, National Institutes of Health, United States
  • Virginie Sjoelund, Functional Cellular Networks Section, LISB, NIAID, National Institutes of Health, United States
  • Aleksandra Nita-Lazar, Functional Cellular Networks Section, LISB, NIAID, National Institutes of Health, United States


Presentation Overview: Show

MARCKS (Myristoylated Alanine-rich C-kinase Substrate) is a membrane protein expressed in many cell types including macrophages, and functionally related to cell adhesion, phagocytosis and inflammatory responses. LPS (Lipopolysaccharide), one of the strongest PAMPs (Pathogen-Associated Molecular Patterns), triggers the inflammation via TLR4 (Toll like receptor 4). During TLR4 stimulation, MARCKS is phosphorylated by PKC (Protein kinase C) resulting in its release to the cytosol followed by activation of the inflammatory signal transduction pathways. The phosphorylation site of MARCKS (phospho-MARCKS) on serine (S163) may have a regulatory role since we found changes in its phosphorylation during TLR stimulations. Serine phosphorylation serves as a key regulator of many physiological processes including innate and adaptive immune responses. Although MARCKS and phospho-MARCKS in macrophages were already described, the role of MARCKS and phospho-MARCKS in the context of macrophage functions remain unclear. As a proof-of-concept study, we activated macrophages with LPS with or without PKC inhibitor. We found that PKC inhibitor predominantly decreased IL6 and TNF production. In addition, MARCKS and phospho-MARCKS increased its co-localization with the endosome in response to LPS stimulation as determined by confocal microscopy. Moreover, the CRISPR-CAS9-mediated knockout of MARCKS in macrophages downregulated TNF and IL6 during LPS stimulation, suggesting the potential impact of MARCKS on the inflammatory responses. Our comprehensive proteomic analysis comparing LPS-stimulation of WT and CRISPR-CAS9 mediated knockout of MARCKS revealed the global changes of proteome involvement in specific biological processes including the inflammatory responses, cytokine-mediated signaling pathways as well as points to the involvement of specific proteins that may regulate the changes during LPS stimulation. The discovery of the mechanism by which MARCKS contributes to the inflammatory response may provide new strategies to manipulate inflammation-related diseases.

“This research was supported by the Intramural Research Program of NIAID, NIH.”

07: Single-cell and bulk transcriptome sequencing identifies two epithelial tumor cell states and refines the consensus molecular classification of colorectal cancer
COSI: rsg
  • Ignasius Joanito, A*STAR, Singapore
  • Pratyaksha Wirapati, Swiss Institute of Bioinformatics, Switzerland
  • Nancy Zhao, A*STAR, Singapore
  • Zahid Nawaz, A*STAR, Singapore
  • Grace Yeo, A*STAR, Singapore
  • Fiona Lee, A*STAR, Singapore
  • Christine Eng, A*STAR, Singapore
  • Dominique Camat Macalinao, NCCS, Singapore
  • Merve Kahraman, A*STAR, Singapore
  • Harini Srinivasan, A*STAR, Singapore
  • Vairavan Lakshmanan, A*STAR, Singapore
  • Sara Verbandt, Katholieke Universiteit Leuven, Belgium
  • Petros Tsantoulis, University of Geneva, Switzerland
  • Nicole Gunn, NCCS, Singapore
  • Prasanna Nori Venkatesh, A*STAR, Singapore
  • Zhong Wee Poh, A*STAR, Singapore
  • Rahul Nahar, MSD International GmbH , Singapore
  • Hsueh Ling Janice Oh, MSD International GmbH , Singapore
  • Jia Min Loo, A*STAR, Singapore
  • Shumei Chia, A*STAR, Singapore
  • Lih Feng Cheow, NUS, Singapore
  • Elsie Cheruba, NUS, Singapore
  • Michael Thomas Wong, MSD International GmbH, Singapore
  • Lindsay Kua, A*STAR, Singapore
  • Clarinda Chua, NCCS, Singapore
  • Andy Nguyen, NantOmics, USA
  • Justin Golovan, NantOmics, USA
  • Anna Gan, A*STAR, Singapore
  • Wan-Jun Lim, NCCS, Singapore
  • Yu Amanda Guo, A*STAR, Singapore
  • Choon Kong Yap, A*STAR, Singapore
  • Brenda Tay, NCCS, Singapore
  • Yourae Hong, Samsung Genome Institute, Korea
  • Dawn Qingqing Chong, NCCS, Singapore
  • Aik-Yong Chok, Singapore General Hospital, Singapore
  • Woong-Yang Park, Samsung Genome Institute, Singapore
  • Shuting Han, NCCS, Singapore
  • Mei Huan Chang, Singapore General Hospital, Singapore
  • Isaac Seow En, Singapore General Hospital, Singapore
  • Cherylin Fu, Singapore General Hospital, Singapore
  • Ronnie Mathew, Singapore General Hospital, Singapore
  • Ee-Lin Toh, Singapore General Hospital, Singapore
  • Lewis Z. Hong, MSD International GmbH, Singapore
  • Anders Jacobsen Skanderup, A*STAR, Singapore
  • Ramanuj DasGupta, A*STAR, Singapore
  • Chin-Ann Johnny Ong, NCCS, Singapore
  • Kiat Hon Lim, Singapore General Hospital, Singapore
  • Emile K. W. Tan, Singapore General Hospital, Singapore
  • Si-Lin Koo, NCCS, Singapore
  • Wei Qiang Leow, Singapore General Hospital, Singapore
  • Sabine Tejpar, Katholieke Universiteit Leuven, Belgium
  • Shyam Prabhakars, A*STAR, Singapore
  • Iain Beehuat Tan , NCCS, Singapore


Presentation Overview: Show

Colorectal cancer (CRC) is widely classified across clinical & biological studies into 4 consensus molecular subtypes (CMS), based on bulk gene expression profiles. However, the underlying epithelial cell diversity remains unclear. Here, we sought to study the epithelial subtypes that underpin the molecular classification of colorectal cancer. Single-cell transcriptomes of 189 samples from 63 patients across 5 cohorts were integrated to construct one of the largest single-cell CRC datasets to date. Of the 373,058 single cells profiled, we focused primarily on the 49,155 epithelial cells. Remarkably, amongst malignant epithelial cells, 2 distinct subtypes consistently emerged after independent analyses of single cell expression, regulon and inferred copy number profiles, suggesting a common genetic program dictating 2 major epithelial subtypes in CRC. We quantified our intrinsic epithelial signature in 3,614 bulk transcriptomes across 15 datasets and recapitulated these 2 intrinsic subtypes. We observed a correspondence to the CMS classification and termed the 2 epithelial groups intrinsic-consensus molecular subtypes (iCMS), consisting of iCMS2 (i2) and iCMS3 (i3). Amongst MSS cancers, most CMS2 and CMS3 tumors had i2 and i3 epithelium, respectively. MSI-H and CMS1 cancers were generally classified as iCMS3. Importantly, we find that MSS tumors with i3 epithelium (i3_MSS) had transciptomic, genomic and biological pathway enrichments features that were more similar to MSI-H cancers than to i2_MSS cancers. The fibrotic CMS4 group comprised cancers with either i2 or i3 epithelial cells, suggesting that fibrosis is orthogonal to the intrinsic CRC epithelial structure. Importantly, whilst CMS4 as a whole shows is associated with shows poor relapse-free survival, we identified the subclass of fibrotic CMS4 cancers with i3 epithelial cells that had the worst prognosis found. With these insights, we propose a refined refinement of the classification of colorectal cancer, the IMF classification, with 5 subtypes combining intrinsic epithelial subtype (I), microsatellite instability status (M) and fibrosis (F).

08: Binding-site resolution chromatin accessibility dynamics of Cebpa enhancers during macrophage-neutrophil differentiation
COSI: rsg
  • Trevor Long, University of North Dakota, United States
  • Tapas Bhattacharyya, Michigan State University, United States
  • Madison Naylor, University of North Dakota, United States
  • Andrea Repele, University of North Dakota, United States
  • Shawn Krueger, University of North Dakota, United States
  • Manu Manu, University of North Dakota, United States


Presentation Overview: Show

During hematopoiesis, multipotential progenitors differentiate into all the types of cells found in blood by the action of complex gene regulatory networks (GRNs) comprising transcription factor (TF) genes that regulate each other’s expression. The gene regulatory logic—by which we mean the identities of the TF regulators, where they bind, whether they activate or repress, and how they interact with each other—remains unknown for most hematopoietic genes. Here, we utilize high coverage ATAC-Seq and reporters integrated in a site-specific manner to investigate binding-site resolution chromatin accessibility dynamics of Cebpa enhancers during macrophage-neutrophil differentiation. Reporter genes were integrated into the ROSA26 locus using CRISPR/Cas9 in PUER cells, which can be differentiated into neutrophils or macrophages in vitro. Time series reporter data showed an upregulation of two enhancers during the first 48 hours of neutrophil differentiation, which matched the temporal expression pattern of the endogenous Cebpa gene. Chromatin accessibility was also profiled at several time points by high coverage (300 million reads per sample) ATAC-Seq. The accessibility patterns were highly correlated between time points, allowing us to pool the data and achieve ~100x coverage. These data revealed that the enhancers had many protected regions interspersed with short high-accessibility islands. All previously characterized TF binding sites and centers of ChIP-Seq peaks overlapped protected regions, indicating that they are TF footprints. Surprisingly, while the total accessibility of the enhancers increased during differentiation, this increase occurred after gene expression had peaked, implying that the total accessibility of the enhancers was not driving gene expression. However, Tn5 counts at the high-accessibility islands adjacent to TF footprints, especially for C/EBPα sites, increased at earlier time points, suggesting that TF occupancy drives the upregulation in gene expression as well as the later increase in the total accessibility of the enhancers. Our high coverage accessibility maps reveal a complex regulatory architecture, with 5-6 TFs binding up to 20 sites in a single Cebpa enhancer. Furthermore, the temporal dynamics of accessibility and gene expression do not support a causal role for total accessibility in gene expression. More generally, these results show that profiling gene expression and accessibility at high temporal- and genomic-resolution have the potential to reveal the causal drivers of gene expression changes during development.

09: A Robust Bayesian Approach to Bulk Gene Expression Deconvolution with Noisy Reference Signatures
COSI: rsg
  • Saba Ghaffari, University of Illinois at Urbana-Champaign, United States
  • Ehsan Saleh, University of Illinois at Urbana-Champaign, United States
  • Steven M. Offer, Mayo Clinic, United States
  • Saurabh Sinha, Georgia Institute of Technology, United States


Presentation Overview: Show

Differential gene expression in bulk transcriptomics data can reflect regulated change of transcript abundance within a cell type and/or change in the proportion of cell types within the sample. To differentiate these scenarios, bulk expression deconvolution methods have been developed, which reveal cell type-resolution transcriptomes at the larger scales afforded by bulk RNA-seq compared to single-cell RNA-seq, allowing greater statistical power in detecting transcriptomic and compositional changes in a biological process. These methods model a bulk RNA-seq profile as a weighted mixture of cell type-specific profiles, known as “signatures,” and estimate the weights and/or cell type signatures that comprise the bulk profile. A common approach is to rely on reference signatures from similar biological conditions to estimate cell type proportions. However, reference signatures, which are commonly obtained from scRNA-seq data, are often unsuitable for deconvolving bulk RNA-seq data due to a difference in technologies or biological conditions.

We present BEDwARS, a Bayesian deconvolution method specifically designed to address potential differences between reference signatures and true but unknown signatures underlying the bulk transcriptomic profiles. BEDwARS defines a Bayesian model of the bulk profile, using reference signatures to parameterize the prior distribution on true signatures, and performs maximum a posteriori estimation of these signatures as well as cell type proportions, using Metropolis-Hastings sampling. Through extensive benchmarking utilizing eight different datasets derived from pancreas and brain and by generating additional noisy reference signatures with varying degrees of added noise we demonstrate that not only does our method outperform leading in-class methods such as FARDEEP, CIBERSORT and CIBERSORTx in the estimation of cell type proportions, it is also more robust to noise in the reference signatures. Furthermore, our method typically achieves a better estimation of true cell type signatures than RODEO, the state-of-the-art method for cell type signature estimation.

We applied BEDwARS to study a rare pediatric inborn error of metabolism linked to Dihydropyridine Dehydrogenase (DPD) deficiency. Using a limited set of scRNAseq data, we deconvolved bulk RNAseq data from complex patient-derived neural organoids to identify novel expression changes in specific neural cell types. Our findings suggest that multiple changes likely contribute to the pathology of the disorder, including disruptions to ciliated cell function, mitochondrial dysfunction, and alterations to translational machinery at the ER and associated with free ribosomes. To our knowledge this is the first reported evidence for possible involvement of ciliopathy and impaired translational control in the etiology of the disorder.

10: A machine learning framework for transcription factor network mapping and multi-omics integration
COSI: rsg
  • Dhoha Abid, Washington University in Saint Louis, United States
  • Michael R. Brent, wustl, United States


Presentation Overview: Show

TF network maps are defined by direct, functional (TF, target) edges. Direct means that the TF must bind to the cis-regulatory DNA of the target gene. Direct edges have been approximated by binding location experiments such as ChIP-Seq, but this is imperfect because a TF can be bound without affecting the expression of the target gene. An edge is functional if changes in the activity of the TF result in changes in the expression of the target gene. In the past two decades, TF network maps have been mainly inferred by regressing the expression level of the target gene on those of all TFs. This approach is based on the idea that if the expression level of a TF is correlated with that of a gene, it may be because the TF regulates the gene. Yet, the expression levels of a gene and a TF can be correlated because the TF regulates the gene indirectly, through one or more intermediate TFs, which does not meet the requirement that a TF network map have direct edges. This limitation can be addressed by combining gene-expression and binding data since these two types of data have complementary strengths and weaknesses. In this study, we combine these complementary data types with a novel, unconventional use of ML. Features are derived from gene-expression data, the ‘functional’ signal of an edge, and labels are derived from binding data, the ‘direct’ signal of an edge. Training instances are (TF, target) edges with features derived from regression and differential expression analyses of gene expression data. The training labels are one if there is evidence of the TF binding to the promoter region of the gene and zero otherwise. Here, the goal is not to generalize to unseen data, but to forge a consensus network from the features and the labels. The predictions of the trained model on the training instances are used as probability scores for potential (TF, target) edges. We found that these predictions are better than inferring a TF network map from gene-expression alone or binding data alone. Furthermore, training a separate model for each TF resulted in the most accurate TF network map. Overfitting a TF-specific model increased the influence of the binding data on the prediction scores, resulting in a more accurate network. We will describe how this unique application of ML can be applied to other problems in multi-omics integration.

11: FUN-PROSE: A Deep Learning Approach to Predict Condition-Specific Gene Expression
COSI: rsg
  • Ananthan Nambiar, University of Illinois Urbana-Champaign, United States
  • Veronika Dubinkina, University of Illinois Urbana-Champaign, United States
  • Simon Liu, University of Illinois Urbana-Champaign, United States
  • Sergei Maslov, University of Illinois Urbana-Champaign, United States


Presentation Overview: Show

mRNA levels of all genes in a genome is a critical piece of information defining the overall state of the cell in a given environmental condition. Being able to reconstruct such condition-specific expression in fungal genomes is particularly important for the task of metabolic engineering of these organisms to produce desired chemicals in industrially scalable conditions. Most of the previous deep learning approaches focused on predicting the average expression levels of a gene based on its promoter sequence, ignoring its variation across different conditions. Here we present FUN-PROSE - a deep learning model trained to predict differential expression of individual genes across various conditions using their promoter sequences and expression levels of all transcription factors. We train and test our model on three fungal species: Saccharomyces cerevisiae, Neurospora crassa and Issatchenkia orientalis and get the correlation between predicted and observed condition-specific gene expression as high as 0.85. We then interpret our model to extract promoter sequence motifs responsible for variable expression of individual genes. We also carried out input feature importance analysis to connect individual transcription factors to their gene targets. A sizeable fraction of both sequence motifs and TF-gene interactions learned by our model agree with previously known biological information, while the rest corresponds to either novel biological facts or indirect correlations.

12: AMPK, friend or foe in cancer?
COSI: dream
  • Mehrshad Sadria, University of Waterloo, Canada
  • Deokhwa Seo, University of Waterloo, Canada
  • Anita Layton, University of Waterloo, Canada


Presentation Overview: Show

Nutrient acquisition and metabolism pathways are altered in cancer cells to meet bioenergetic and biosynthetic demands. A major regulator of cellular metabolism and energy homeostasis, in normal and cancer cells, is AMP-activated protein kinase (AMPK). AMPK influences cell growth via its modulation of the mechanistic target of Rapamycin (mTOR) pathway, specifically, by inhibiting mTOR complex mTORC1, which facilitates cell proliferation, and by activating mTORC2 and cell survival. Given its conflicting roles, the effects of AMPK activation in cancer can be counterintuitive. Prior to the establishment of cancer, AMPK acts as a tumour suppressor. However, following the onset of cancer, AMPK has been shown to either suppress or promote cancer, depending on the cell type or state. To unravel the controversial roles of AMPK in cancer, we developed a computational model to simulate the effects of pharmacological maneuvers that target key metabolic signalling nodes, with a specific focus on AMPK, mTORCs, and their modulators.

13: Most High Throughput Expression Data Sets Are Underpowered
COSI: rsg
  • Alexander Trostle, Baylor College of Medicine, United States
  • Jiasheng Wang, Baylor College of Medicine, United States
  • Lucian Li, Baylor College of Medicine, United States
  • Ying-Wooi Wan, Baylor College of Medicine, United States
  • Zhandong Liu, Baylor College of Medicine, United States


Presentation Overview: Show

Researchers handicap their high-throughput sequencing experiments when they do not perform enough biological replicates. Given the natural heterogeneity in biological tissues, studies with too few replicates will fail to detect true signature, rendering the results irreproducible. From our robust statistical analysis of 450 data sets, we propose that at least six biological replicates per condition should be the standard for high-throughput differential expression experiments

14: Advancing Modeling and Biomarker Identification in Biomedicine using Quantum Machine Intelligence
COSI: rsg
  • Nam Nguyen, Moffitt Cancer Center, United States
  • Kwang-Cheng Chen, University of South Florida, United States
  • Aleksandra Karolak, Moffitt Cancer Center, United States


Presentation Overview: Show

Successes and promises of quantum computing and simulation allow us to address emergent problems in biomedical science via novel perspectives of quantum mechanics and machine intelligence. In this work, we introduce two neural architecture designs for Quantum Neural Networks (QNNs) that overcome the limitations of neural intelligence machine in the Noisy Intermediate-Scale Quantum era and to tackle tumor burden modeling and biomarker identification in oncology. First, we introduce the EtaNet model with alterations by the successful parameter-sharing and attention-based mechanism from classical Artificial Neural Networks. EtaNet enables Bayesian regressors for modeling tumor volume concerning the resistance to drug effects. Second, we propose the TripletEta-Fields model, including three periodical quantum kernels with physical-based information aggregation to perform general regression tasks in machine learning. By demonstrating the proposed machine intelligence from the perspective of probability theory and probabilistic information blocks, we propose a rigorous interpretation for the model and provide a proof of concept by modeling of COVID-19 growth population and tumor growth stratified by treatments. Both proposed model architectures outperform the classical-mathematical models in terms of accuracy and explainability of the model prediction. Finally, we extend the application of the introduced QNN to biomarker identification for targeted pathways in cancer research. The expressiveness of quantum distributions at the output of QNNs allows us to address genetic drivers for targeted genes with quantum advantage and enables an explainable hybrid Deep Learning Diagnosis System (DLDS). Our case study on pathway-of-interest in pancreatic cancer reveals several novel drivers for such pathways beyond clinician-known regimes. Moreover, the proposed DLDS offers explainable deep survival analysis and risk estimations per patient of interest with interactive visualization and analysis, enhancing the human-machine interface compared to conventional machine learning approaches.

15: Single-cell nanopore long-read RNA sequencing of colorectal cancer cells reveals highly heterogeneous microsatellite loci for retrospective lineage tracing
COSI: rsg
  • Hyun Su An, Gwangju Institute of Science and Technology, South Korea
  • Kyu Min Park, Gwangju Institute of Science and Technology, South Korea
  • Jihwan Park, Gwangju Institute of Science and Technology, South Korea


Presentation Overview: Show

Colorectal cancer (CRC) is one of the most prevalent cancers in humans. About 10~20% of stage I to III CRCs can be classified as microsatellite instability-high (MSI-H). Cancer cells with microsatellite instability (MSI) have defective DNA mismatch repair (MMR), leading to the accumulation of mutations at a faster rate (genetic hypermutability). Due to a high tumor mutational burden, CRCs with MSI-H phenotype respond better at immune checkpoint inhibitors (ICI), such as Pembrolizumab. However, the prognosis of advanced CRCs with MSI-H is still poor due to the low response rate (30~50%) and intrinsic resistant mechanisms to immunotherapy. Therefore, there is an urgent need to understand the biological mechanisms behind immunotherapy resistance and metastatic phenotype of advanced CRCs with MSI-H. We reasoned that the high mutation rate of MSI-H CRCs might enable high-resolution retrospective lineage tracing of cancer cells through single-cell RNA sequencing, thus providing new insights into the subclonal structures of MSI-H CRCs and their response to immunotherapy. HCT116 is a colorectal carcinoma cell line from a human male. HCT116 is highly metastatic and commonly used in the mouse model of metastatic colorectal cancers. HCT116 is also classified as microsatellite instability-high (MSI-H). The defective mismatch repair (MMR) pathway in cancer cells with microsatellite instability increases rates of microsatellite deletion and insertions and single nucleotide mutations. Even though most microsatellite loci are located in intergenic regions, a significant number of microsatellites (505,657 microsatellites) are found in the expressed transcripts from RefSeq. We generated a single-cell full-length cDNA library of 5,847 HCT116 cancer cells using 10X Chromium 'Single Cell 3' kit v3' kit. The library was sequenced on an ONT long-read sequencer with a single PromethION 10.4 flowcell (FLO-MIN112) using the LSK112 protocol. We developed a comprehensive pipeline for detecting variable microsatellite alleles from single-cell long-read RNA-sequencing data. We observed that HCT116 cancer cells have variable microsatellite alleles at several hundred microsatellite loci, including the locus (chr17:28,723,900-28,723,925) in the RPL23A transcript. Our analysis pipeline can be utilized to select informative microsatellite loci to increase the cost-effectiveness of retrospective lineage tracing of single cancer cells using long-read RNA-sequencing.

16: Using CRISPR deletion and enhancer activity data to identify transcriptional regulatory motifs in non-coding sequence and learn the enhancer sequence code
COSI: rsg
  • Andrew Duncan, University of Toronto, Canada
  • Shanelle Mullany, University of Toronto, Canada
  • Alan Moses, University of Toronto, Canada
  • Jennifer Mitchell, University of Toronto, Canada


Presentation Overview: Show

Non-coding transcriptional enhancers are critical for development, phenotype divergence during evolution and often mutated in disease contexts; however, even in well-studied cell types, the sequence code conferring enhancer activity remains unknown. Enhancers are key drivers of pluripotency maintenance and the reprogramming process; therefore, determining the repertoire of sequences and transcription factors that confer activity to these regions will provide a better understanding of the pluripotent state and reveal transcriptional control mechanisms that define cell identity. CRISPR deletion of regions identified as enhancers based on chromatin features (histone modification or transcription factor binding) reveals our knowledge gaps as only some display regulatory activity in the genome. We initially used comparative epigenomics to identify conserved enhancers in naïve mouse and human embryonic stem cells. Machine learning revealed these sequences are enriched in a conserved repertoire of 70 different transcription factor binding site sequences (TFBS) including those that bind known and novel pluripotency regulators. Next we trained a neural network to predict enhancer activity from DNA sequence in mouse embryonic stem cells using genome-wide STARR-seq data. We found that, similar to the comparative epigenomics approach, the STARR-seq trained neural network attributes importance to many well-known pluripotent state TFBS, such as those for OCT4, SOX2, and KLF4. In addition, this approach also identified an extended repertoire of motifs that contribute to enhancer activity. These findings break from a widespread belief that a select few master regulators are sufficient to explain enhancer activity and is consistent with our experimental observations regarding the importance of TFBS diversity in both native and synthetic enhancers. By comparison to this experimental data we were also able to show that the neural network identifies experimentally validated binding sites for SP1, ESRRB and SOX3 as important for activity in one of the native SOX2 enhancers, demonstrating that the neural network is learning genuine enhancer features. Finally, we construct synthetic sequences designed to extract the features and rules that are learned by the neural network and compare the findings to experimental data and native genome sequences. This combined and iterative approach is revealing new complexity in the enhancer regulatory code and identifying the ways in which machine learning both informs our understanding and how this approach can be limited without accompanying in genome experimental data.

17: scHiMe: Predicting single-cell DNA methylation levels based on single-cell Hi-C data
COSI: rsg
  • Hao Zhu, University of Miami, United States
  • Tong Liu, University of Miami, United States
  • Zheng Wang, University of Miami, United States


Presentation Overview: Show

Recently a biochemistry experiment was developed to simultaneously capture the chromosomal conformations and DNA methylation levels on single cells. A computational tool to predict single-cell methylation levels based on single-cell Hi-C data becomes necessary due to the availability of this experiment. scHiMe was developed to predict the base-pair-specific methylation levels in the promoter regions genome-wide based on the single-cell Hi-C data and DNA nucleotide sequences using the graph transformer algorithm. Promoter-promoter spatial interaction networks were built based on single-cell Hi-C data, and single-cell DNA methylation levels on 1000 base pairs for each promoter were predicted based on the network topology and DNA sequence. Our evaluation results showed a high consistency between the predicted and the true methylation values. We tested using predicted DNA methylation levels on all promoters to classify cells into different cell types, and our results showed that the predicted DNA methylation levels resulted in almost perfect cell-type classification, which indicated that our predictions maintained the cell-to-cell variability. We also tested using the predicted DNA methylation levels of different subsets of promoters and different subsets of CpGs in promoters to classify cells and provided the promoters and CpGs that were most influential in cell-type clustering. Moreover, we observed slightly better performance for the nodes that have higher degree values in the promoter-promoter spatial interaction network but did not find a similar trend for more significant network influencers. Last but not least, we found that using the predicted methylation levels of only housekeeping genes led to less accurate cell-type clustering, which demonstrated that our methylation predictions fit the biological meanings of housekeeping genes since housekeeping genes usually have constant and similar genetic and epigenetic features among different types of cells. scHiMe is freely available at http://dna.cs.miami.edu/scHiMe/.

18: Hyper-methylation of ABCG1 as an Epigenetic Biomarker in Non-small Cell Lung Cancer
COSI: rsg
  • Thi-Oanh Tran, Taipei Medical University, Taiwan
  • Ho Thanh Lam Luu, Taipei Medical University, Taiwan
  • Nguyen Quoc Khanh Le, Taipei Medical University, Taiwan


Presentation Overview: Show

Non–small cell lung cancer (NSCLC) is the most prevalent histological type of Lung cancer and the leading cause of death globally. Patients with NSCLC have a poor prognosis for a variety of factors, and a late diagnosis is one of them. The DNA methylation of CpG island sequences found in the promoter regions of tumor suppressor genes is receiving attention recently as a potential biomarker of human cancer. In this study, we investigate DNA methylation changes of the adenosine triphosphate (ATP)-binding cassette transporter G1 (ABCG1), a member of one of the largest and possibly oldest gene families of the ATP cassette transporter in NSCLC patients. Recent studies show variant ABCG1 genetics and the aberrant expression of ABCG1 protein in lung cancer. However, the mechanism under the abnormal expression of this protein in NSCLC is still unclear. By bioinformatic approach, our study demonstrates that ABCG1 is hyper-methylation in NSCLC samples and these changes are negatively correlated to gene expression. Furthermore, the expression of ABCG1 gene is significantly associated with the survival rate of lung adenocarcinoma (LUAD) patients, however, it did not show a correlation to overall survival (OS) of lung squamous cell carcinoma (LUSC) patients. A novel point in this study, ABCG1 hypermethylation at locus cg20214535 may be a potential epigenetic biomarker for targeted therapy in LUAD by the evidence that DNA methylation probe-level of ABCG1 strong related to OS rate of LUAD patients. At the protein level, we found that ABCG1 constantly shows the weak expression in NSCLC tissue. In addition, this study illustrates the protein-protein interaction (PPI) of ABCG1 to other proteins and the strong communication ABCG1 with immune cells. In summary, ABCG1 epigenetics phenotype is a novel biomarker in the prognosis of NSCLC and a potential target in the treatment of lung cancer patients.

19: Transcriptional signatures of cell-cell interactions are dependent on cellular context
COSI: rsg
  • Brendan Innes, University of Toronto, Canada
  • Gary Bader, University of Toronto, Canada


Presentation Overview: Show

Cell-cell interactions are often predicted from single-cell transcriptomics data based on observing receptor and corresponding ligand transcripts in cells. These predictions could theoretically be improved by inspecting the transcriptome of the receptor cell for evidence of gene expression changes in response to the ligand. It is commonly expected that a given receptor, in response to ligand activation, will have a characteristic downstream gene expression signature. However, this assumption has not been well tested. We used ligand perturbation data from both the high-throughput Connectivity Map resource and published transcriptomic assays of cell lines and purified cell populations to determine whether ligand signals have unique and generalizable transcriptional signatures across biological conditions. Most of the receptors we analyzed did not have such characteristic gene expression signatures – instead these signatures were highly dependent on cell type. Cell context is thus important when considering transcriptomic evidence of ligand signaling, which makes it challenging to build generalizable ligand-receptor interaction signatures to improve cell-cell interaction predictions.

20: Clustering of human Micro-C and chromatin state data identifies chromosome conformation signatures in cell types
COSI: rsg
  • Corinne Sexton, University of Nevada Las Vegas, United States
  • Mira Han, University of Nevada Las Vegas, United States


Presentation Overview: Show

Chromatin states based on chromatin mark ChiP-Seq datasets are a common annotation for genomes. Softwares such as ChromHMM and Segway employ Hidden Markov Models to learn patterns among chromatin marks which subsequently have been shown to correspond to well-known regulatory annotations such as enhancers and transcription start sites (TSS). With the advent of Hi-C and other chromosome conformation capture sequencing derivatives, we now have the ability to analyze 2-dimensional physical TSS-enhancer interactions, rather than just the 1-dimensional regulatory annotation.

However, the patterns of chromosome conformation itself have not been well explored. We propose an unsupervised clustering approach using the k-prototypes algorithm to cluster Micro-C contacts for each annotated chromatin state to discover chromosome conformation patterns. We performed our clustering analysis with data from H1, HFF, HeLA-S3, and Definitive Endoderm (DE) cells.

Our results indicate that certain chromatin states interact in specific patterns. For example bivalent enhancers mainly contact other bivalent enhancers or bivalent transcription start sites and flanking transcription start sites associate mainly with weak enhancers.

Additionally, most chromosome conformation patterns are shared between the four cell types used here, with the notable exception of the bivalent cluster which is absent in HeLa cells only.

Between 5 to 30% of the regions change cluster membership between the four cell types, hinting at the dynamic nature of chromosome conformation and the necessity of cell-type specific folding. Overall, we present the first set of chromosome conformation signatures which will assist researchers with interpretation of 2D genome regulation.

21: Novel GEM Workflow Unveiling Cellular Metabolism from Gene Expression Data
COSI: rsg
  • Herbert Yao, Queen's University at Kingston, Canada
  • Sanjeev Dahal, Queen's University at Kingston, Canada
  • Laurence Yang, Queen's University at Kingston, Canada


Presentation Overview: Show

Gene expression data of cell cultures is commonly measured in biological and medical studies to understand cellular decision-making in various conditions. Metabolism, affected but not solely determined by the expression, is much more difficult to measure experimentally. Thus, finding a reliable method to predict cell metabolism for given expression data will greatly benefit model-aided cell engineering studies including cellular production of chemicals, foods, and therapeutics.
We have developed such a pipeline that can unbiasedly explore cellular fluxomics from expression data, starting from only a high-quality genome-scale metabolic model. This is done through two main steps: first, construct a protein-constrained metabolic model by integrating protein and enzyme information into the basic metabolic model, which further constrains the network. Secondly, overlay the expression data onto the modified model using a new two-step nonconvex and convex optimization formulation, resulting in context-specific models with optionally calibrated rate constants. The resulting model computes proteomes and intracellular flux states that are consistent with the measured transcriptomes. Therefore, it provides detailed cellular insights that are difficult to glean individually from the omic data or metabolic models alone.
As a case study, we apply the pipeline to interpret triacylglycerol (TAG) overproduction by Chlamydomonas reinhardtii, using time-course RNA-Seq data. The pipeline allows us to compute C. reinhardtii metabolism under nitrogen deprivation and metabolic shifts after an acetate boost. We also suggest a list of possible ‘bottlenecking’ proteins that need to be overexpressed to increase the TAG accumulation rate, as well as other TAG-overproduction strategies. The code is available open source with detailed documentation and is expected to aid cell engineering by providing intracellular metabolic insights for any measured transcriptome.

22: Multi-modal inference of phenotype-relevant transcriptional regulatory networks in embryonic development using InPheRNo-ChIP
COSI: rsg
  • Chen Su, McGill University, Canada
  • William Pastor, McGill University, Canada
  • Amin Emad, McGill University, Canada


Presentation Overview: Show

Multi-modal inference of phenotype-relevant transcriptional regulatory networks in embryonic development using InPheRNo-ChIP
Authors: Chen Su, Dr. William A. Pastor and Dr. Amin Emad

The lineage specification at the very early stage of human embryogenesis is of specific interest in developmental biology, and the underlying regulatory mechanism remains to be further clarified. The emergence of high-throughput sequencing technologies and progress in developing in-vitro models for mammalian embryos enable researchers to study previously inaccessible biological processes, such as organogenesis and lineage formation. To elucidate the mechanism involved in the differentiation of human embryonic stem cells (hESCs) to endoderm (dEN), we developed a computational framework, InPheRNo-ChIP, which integrates transcriptomic data from three independent studies, with genome-wide DNA–protein interactions and phenotypic labels (representing different stages of differentiation) to reconstruct a transcriptional regulatory network (TRN) involved in this process.

InPheRNo-ChIP extends the original InPheRNo framework to not only incorporate multi-omics data from various sources, but also reconstruct TRNs relevant to phenotypic variations of interest. It first estimates two sets of p values of TF-gene associations and one set of p values of gene-phenotype associations from integrated RNA-seq and ChIP-seq analysis. It then utilizes a carefully designed probabilistic graphical model (PGM) to model those summary statistics and generate an initial bipartite graph. Several normalization steps and comparisons with different controls follow to identify TF-gene relationships that are associated with the differentiation process.

Our analysis showed that InPheRNo-ChIP can recover both novel and known regulatory mechanisms of endoderm formation in the context of hESC differentiation. We validated the inferred network in an in-vitro scRNA-seq-based CRISPRi screening dataset involving multiple molecular drivers of human endoderm differentiation. Notably, focusing on the list of target genes of 3 TFs, FOXA2, SOX17, and SMAD2, which are known to be involved in endoderm differentiation, we found that genes identified by InPheRNo-ChIP are significantly enriched for marker genes associated with early endoderm formation processes in the CRISPRi study. We also validated the list of targets for all TFs of interest using external databases such as GTRD and LINCS. Overall, this study identified a core set of TF-gene edges for endoderm formation. Moreover, we showed that InPheRNo-ChiP can unravel the transcriptional mechanisms of cell differentiation in the course of human embryogenesis.

23: Community Detection Analysis in Multilayer COVID-19 Patient Similarity Networks
COSI: rsg
  • Piotr Sliwa, University of Oxford, United Kingdom
  • Heather Harrington, University of Oxford, United Kingdom
  • Gesine Reinert, University of Oxford, United Kingdom
  • Julian C. Knight, University of Oxford, United Kingdom


Presentation Overview: Show

Developments in experimental biology have enabled the collection of multiple molecular modalities per patient for large cohorts and exacerbated the need to develop and apply methods to integrate such datasets. Multiple ideas have been proposed how to use networks towards that aim. For example Similarity Network Fusion, which creates sample similarity networks for all data types and then fuses them into a final network that combines all the available information, has been successfully applied in the analysis of cancer datasets.

Instead, here we construct patient similarity networks in a principled fashion, one network per modality. Individual networks are created via a resampling approach, COGENT, which ensures that the combination of parameters that we use, results in the most robust single-modality networks. We then combine the similarity networks into a multilayer network by coupling the different layers according to the level of information shared between them. Next we run the Leiden algorithm to discover the groupings of patients and explore the resulting communities for correlations with important clinical and molecular variables. The underlying dataset is a recent comprehensive multimodal dataset containing among others information on COVID-19 and sepsis patients and healthy volunteers spanning modalities such as single cell and bulk gene expression levels, plasma and serum proteomics, and cell composition measured by mass cytometry along with detailed clinical information. Through these network approaches, we integrate the data to identify 5 communities, we assess their stability and find that they partition the patients into groups that recover the major clinical diagnosis (outpatient, COVID-19 or sepsis) in an unsupervised way. The method refines it via further dividing the patients into subsets with significantly different relevant clinical features (including O2/FiO2 ratio, length of hospital stay, Sequential Organ Failure Assessment (SOFA) score, SOFA oxygenation score) as well as activities of biological pathways (regulation of complement cascade, interleukin signaling, Toll-like Receptor cascade). Our pipeline can also be applied to other similar datasets and may help to better understand multimodal molecular health data.

24: Intelligent personalized risk prediction for intraductal papillary mucinous neoplasms of pancreas
COSI: rsg
  • Nam Nguyen, Moffitt Cancer Center, United States
  • Jason Fleming, Moffitt Cancer Center, United States
  • Patricia McDonald, Moffitt Cancer Center, United States
  • Kwang-Cheng Chen, University of South Florida, United States
  • Aleksandra Karolak, Moffitt Cancer Center, United States


Presentation Overview: Show

Intraductal papillary mucinous neoplasms (IPMN) are a common precursor for invasive pancreatic ductal adenocarcinoma - each IPMN tumor carries a 10-25% risk of progressing to cancer within 10 years of detection. Despite being precursors of pancreatic cancer there is a lack of validated intervention targets for precision prevention for IPMN. Currently cancer prevention for patients with IPMN is centered on surgery using a risk-tailored approach; the clinical ability to stratify risk of cancer progression of individual IPMN tumors is poor and essentially no effective non-surgical interventions exist. Identification of targets for intervention to prevent cancer progression are desperately needed. To advance the precision cancer prevention for IPMN we are developing an intelligent, machine learning (ML) derived framework addressing the challenge of IPMN stratification from samples with limited data space, e.g., mutations and gene expression only. Our study incorporates the text-based deep learning of the mutational profiles of patients’ genomes followed by an integration of codon sequence derived mutational features for biomarkers validation. Given the potential of genome spatial organization to understand mutagenesis and instability, we also map the DNA sequence information onto the local spatial genome organization to elucidate underlying molecular mechanisms and evaluate clinical significance of the key codon mutations and pathways. Our results show that information extracted from mutational profiles with the proposed methods significantly improves performance of the ML models in stratification of IPMN and biomarker identification. We present how utilization of the advanced ML tools for mutational data, spatial genome organization, and functional analyses from tumor samples, can aid understanding of the underlying mechanisms and IPMN risk of progression to pancreatic cancer.

26: Chromatin state annotation of cis-regulatory elements in hundreds of human cell types using Segway with applications to disease association
COSI: rsg
  • Marjan Farahbod, Simon Fraser University, Canada
  • Paul Sud, Stanford University, United States
  • Mehdi Foroozandeh, Simon Fraser University, Canada
  • Abdul Rahman Diab, Simon Fraser University, Canada
  • Ishan Goel, Simon Fraser University, Canada
  • Benjamin Hitz, Stanford University, United States
  • J. Michael Cherry, Stanford University, United States
  • Maxwell Libbrecht, Simon Fraser University, Canada


Presentation Overview: Show

Genome-wide association studies (GWAS) have identified tens of thousands of genetic variants associated with human diseases and traits, but the vast majority of these associations are not backed by a hypothesized mechanism. Understanding such disease-associated variants is hampered by the incompleteness of annotations of the cis-regulatory elements (CREs) driving disease association.
To improve our annotation of regulatory elements, large-scale projects such as ENCODE have recently engaged in epigenome mapping. A primary outcome of such mapping projects are chromatin state annotations created using segmentation and genome annotation (SAGA) algorithms such as Segway or ChromHMM. These methods take as input a collection of epigenomic tracks for a given cell type such as histone modifications and DNA accessibility and output high-resolution annotations that assign each genomic region to a category of activity such as “enhancer” and “promoter”. Segway performs this task in two steps. In the first step, Segway performs unsupervised high-resolution partitioning of the genome into several chromatin states based on the input epigenomics tracks and assigns each segment to a state label corresponding to a type of genomic activity. In the second step, an automated process assigns human-understandable interpretation terms to each state label.
Here we present Segway annotations resulting from all epigenomic data sets collected by ENCODE. These annotations identify all observed cis-regulatory elements in the genome along with their pattern of activity across cell types. Due to the unsupervised nature of these annotations, they reveal novel patterns of epigenetic activity present only in a subset of cell types. We demonstrate that these patterns correspond to functional activity specific to such cell types, demonstrating the importance of comprehensive epigenomic maps.
We demonstrate that these annotations accurately identify the putatively causal element driving disease association within each GWAS-identified locus. We further link each such causal element to the tissues affected by each disease and link to the genes it regulates. Thus, these chromatin state annotations provide an integrative and intuitive way to understand the landscape of disease-associated cis-regulatory elements across hundreds of human cell types.

27: The Nuts and Bolts of NIH Peer Review
COSI: rsg
  • James Li, NIH Center for Scientific Review, United States


Presentation Overview: Show

This poster is intended for new and established investigators who wish to become more familiar with the National Institutes of Health (NIH) peer review process. The mission of Center for Scientific Review at NIH is “to see that NIH grant applications receive fair, independent, expert, and timely scientific reviews - free from inappropriate influences - so that NIH can fund the most promising research.” In this poster presentation, an overview of the NIH peer review process, the timeline from application submission to post review, and types of applications reviewed are presented. Guidance will be provided on how to identify the study section in which an application is to be reviewed.

28: scDisInFact: disentangled learning for integration and imputation of multi-condition single cell RNA-sequencing data
COSI: rsg
  • Ziqi Zhang, Georgia Institute of Technology, United States
  • Xinye Zhao, Georgia Institute of Technology, United States
  • Xiuwei Zhang, Georgia Institute of Technology, United States


Presentation Overview: Show

As more and more scRNA-seq data become available, data integration methods have been developed to integrate multiple datasets that measure the same tissues together, with the goal of integrating the cells into a shared space, while removing the batch effects. However, in studies that consider data from multiple biological conditions, the difference between cell batches is not just the result of batch effect, but also the effect of various biological conditions of the donors (age, disease severity, gender, drug treatments, etc).

Most of the existing methods conduct integration by forcing the latent distribution of cells to “match” between cell batches, which fails to consider the biological variation between cell batches. Ideally, we should remove only the batch effects (which is unwanted technical variation), but preserve the biological variation between batches. We propose scDisInFact (single cell Disentangled Integration preserving condition-specific Factors), a method that integrates cell batches while keeping the batch-specific biological information. scDisInFact is not only an integration tool. It has the following functions:

1. Using the known experimental conditions and batch identities of each scRNA-seq matrix as supervision, scDisInFact is able to disentangle the variation within and cross the data matrices into shared biological factors among batches, batch-specific unshared biological factors (representing the condition-specific effects), and the technical batch effect which is removed.

2. Using a feature selection module, scDisInFact can identify the key genes that are associated with cell response under various biological conditions.

3. Through latent space vector arithmetics, scDisInFact is able to impute the batch-effect-removed gene expression data of every cell under all biological conditions that are included in the input dataset. That is, for a given batch of cells from a certain condition, scDisInFact can predict single cell gene expression data of a different condition.

scDisInFact adapts a deep learning framework designed based on a supervised disentangle variational autoencoder. We tested scDisInFact on simulated and real datasets. We measured the biological factors and batch effect disentanglement, data imputation, and key gene discovery accuracy of scDisInFact using various benchmarking metrics, and compared its performance with baseline methods. The results show a superior performance of scDisInFact in scRNA-seq integration and biological factor disentanglement. The detected condition-associated genes are more accurate compared to genes obtained from a differential expression analysis between data from different conditions. scDisInFact can be used to comprehensively analyze scRNA-seq data with multiple conditions.

29: nf-core/circrna: A workflow for the quantification, miRNA target prediction and differential expression analysis of circRNAs
COSI: rsg
  • Barry Digby, National University of Ireland, Galway, Ireland
  • Stephen Finn, Trinity College Dublin, Ireland
  • Pilib O Broin, National University of Ireland, Galway, Ireland


Presentation Overview: Show

Circular RNAs (circRNAs) are a class of covalenty closed non-coding RNAs (ncRNAs) that have garnered increased attention from the research community due to their stability, tissue-specific expression and role as transcriptional modulators via sequestration of miRNAs. Currently, multiple quantification tools capable of detecting circRNAs exist, yet none delineate circRNA-miRNA interactions, and only one employs differential expression analysis. Efforts have been made to bridge this gap by way of circRNA workflows however, these workflows are limited by both the types of analyses available and computational skills required to run them.

We present nf-core/circrna, a multi-functional, automated high-throughput pipeline implemented in nextflow that allows users to fully characterise the role of circRNAs in RNA Sequencing datasets via three analysis modules: (i) circRNA quantification, robust filtering and annotation (ii) miRNA target prediction of the mature spliced sequence and (iii) automated differential expression analysis. nf-core/circrna has been developed within the nf-core framework, ensuring robust portability across computing environments via containerisation, parallel deployment on cluster/cloud-based infrastructures, comprehensive documentation and maintenance support.

nf-core/circrna reduces the barrier to entry for researchers by providing an easy-to-use, comprehensive, platform-independent and scalable workflow for circRNA analyses. Source code, documentation and installation instructions are freely available at https://nf-co.re/circrna and https://github.com/nf-core/circrna.

30: scSTEM: clustering pseudo-time ordered single cell data
COSI: rsg
  • Qi Song, Carnegie Mellon University, United States
  • Ziv Bar-Joseph, Carnegie Mellon University, United States
  • Jingtao Wang, Mcgill University, Canada


Presentation Overview: Show

Much attention in the analysis of single cell data has focused on modeling trajectories of cell development and differentiation. Much less work in this area focused on the analysis of genes for a given path from reconstructed trajectory. Existing methods mainly focused on clustering genes and do not fully utilize dynamic information obtained from time series scRNA-Seq studies or from trajectory inference methods. Such information may be key to the understanding of cell fate determination by specific functional gene clusters.
15 years ago we have noticed a similar problem when analyzing bulk (microarray) expression data. For bulk data, most time-series datasets were very short making it hard to distinguish significant trends from random noise which leads to general, rather than time-series, clustering methods. To address this, we developed the Short Time Series Expression Miner (STEM). STEM is now the most popular clustering method for time-series expression data and its usage has roughly doubled every four years since its publication, reaching more than 200 citations in 2022 (unusual for a paper published in 2006). However, STEM is not appropriate for single cell data.
To enable the use of STEM and its underlying significance assignment method for single cell data, we developed scSTEM, which clusters dynamic profiles of genes in trajectories inferred from pseudotime ordering of single cell RNA-seq (scRNA-seq) data. scSTEM first uses one of several pseudotime inference methods to construct a trajectory for a given scRNA-Seq data. Next, for every gene in every connected component of the trajectory graph, scSTEM generates summary time series data using several different approaches. This data is then used as input for STEM and clusters are determined for each path in the trajectory. Users can also compare STEM clusters between two branches of a trajectory tree to identify genes and biological processes that led to the divergence of these branches. We compared scSTEM to several other clustering methods and showed that scSTEM improves the identification of functionally relevant clusters while scaling well to large datasets. In addition, comparisons of different trajectory branches using scSTEM provide biological insights about the activity of different cell types, including clusters distinguishing between very similar cell types such as T cells and NK cells. To facilitate the use, we provide scSTEM as an Rshiny app-based GUI tool at: https://github.com/alexQiSong/scSTEM.

31: Detecting differential splicing from smart-seq2 based scRNA-seq data
COSI: rsg
  • Jelard Aquino, University of Nevada, Las Vegas, United States
  • Mira Han, University of Nevada, Las Vegas, United States
  • Amei Amei, University of Nevada, Las Vegas, United States
  • Steve Park, University of Nevada, Las Vegas, United States


Presentation Overview: Show

Alternative splicing is a critical post-transcriptional regulatory process that generates multiple mature mRNAs from a single gene, significantly increasing transcriptomic and proteomic diversity in eukaryotic cells. To date, characterizing alternative splicing events at the single-cell resolution remains a challenging task due to the low sequencing depth and high technical variability in scRNA-seq data. Because of this, it is difficult to correctly quantify RNA isoform expression and detect differential splicing in single cells. New methods have been developed to detect differential alternative splicing events for scRNA-seq data, which falls into two approaches: (1) splice junction based approach and (2) exon fragment based approach. In this study, we simulated scRNA-seq data generated from the smart-seq2 protocol to determine an optimal strategy for detecting differential splicing in scRNA-seq by comparing the splice junction and exon fragment based approaches. Our preliminary results showed that exon fragment based approach achieves higher sensitivity while the splice junction based approach has higher specificity. In addition, we discovered that pooling cells achieves both higher specificity and sensitivity than keeping the cells as separate replicates. Our findings serve as a basis for developing a new method for detecting differential splicing in scRNA-seq data.

32: Stage-Specific Modular and Molecular Network Analyses in Colorectal Cancer
COSI: rsg
  • Sara Rahiminejad, University of California, San Diego, United States
  • Mano Maurya, University of California, San Diego, United States
  • Kavitha Mukund, University of California, San Diego, United States
  • Shankar Subramaniam, University of California, San Diego, United States


Presentation Overview: Show

Although mechanisms contributing to the progression and metastasis of colorectal cancer (CRC) are well studied, stage-specific mechanisms have been less comprehensively explored. This is the focus of this study. Using previously published data for CRC (Gene Expression Omnibus ID GSE21510), we identified differentially expressed genes (DEGs) across four stages of the disease. We then generated unweighted and weighted correlation networks for each stage, separately. Communities (also called as modules) within these networks were detected using the Louvain algorithm and compared topologically and functionally across stages using the normalized mutual information (NMI) metric and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment, respectively. We used Short Time-series Expression Miner (STEM) algorithm to detect potential biomarkers having a role in CRC. 16062 DEGs were identified between various stages (p-value ≤ 0.05). Comparing communities of different stages revealed that neighboring stages were more similar to each other than non-neighboring stages, at both topological and functional levels. A functional analysis of 24 cancer-related pathways indicated that several signaling pathways were enriched across all stages. However, the stage-unique networks were distinctly enriched only for a subset of these 24 pathways (e.g., MAPK signaling pathway in stages I-III and Notch signaling pathway in stages III and IV). We identified potential biomarkers, including HOXB8 and WNT2 with increasing, and MTUS1 and SFRP2 with decreasing trends from stages I to IV. Extracting subnetworks of 10 cancer-relevant genes and their interacting first neighbors (162 genes in total) using STRING-db revealed that the connectivity patterns for these genes were different across stages. For example, BRAF and CDK4, members of the Ser/Thr kinase, up-regulated in cancer, displayed changing connectivity patterns from stage I to IV. Using approved drugs and their target genes from National Cancer Institute (NCI) and DrugBank databases, we constructed a Drug-Target-PPI network and observed that the target gene weights, the sum of the weights of edges connected to it, changed across the four stages extensively. In particular, we saw that TYMS, a target for some drugs such as Fluorouracil Injection (5-FU), was up-regulated in cancer stages with larger weights in stage III than in other stages. Overall, we provided a pseudo-temporal view of the mechanistic changes associated with CRC by analyzing molecular and modular networks at various stages of the disease. Our findings highlighted similarities at both functional and topological levels, across stages. We further identified stage-specific mechanisms and biomarkers potentially contributing to the progression of CRC.

33: SEM: size-based expectation maximization for characterizing nucleosome positions and types
COSI: rsg
  • Jianyu Yang, Penn State University, United States
  • Shaun Mahony, Penn State University, United States
  • Kuangyu Yen, Institute of Hematology & Blood Diseases Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, China


Presentation Overview: Show

Genome-wide nucleosome positions are most popularly characterized using a combination of micrococcal nuclease and high-throughput sequencing (MNase-seq). MNase-seq typically shows the existence of nucleosome-free regions (NFRs) upstream of transcription start sites (TSSs). However, recent studies have proposed that a special type of “fragile” nucleosome resides in MNase-seq-defined NFRs. Fragile nucleosomes are proposed to protect relatively short DNA fragments under low MNase concentrations and cannot resist the higher concentrations of MNase typically used in MNase-seq. Since fragile nucleosomes are located within promoter regions and may play regulatory roles, further characterizing fragile nucleosomes is of critical importance. Currently available nucleosome analysis packages mainly focus on characterizing nucleosome dyad location, occupancy, and positioning fuzziness, and usually assume that nucleosomes protect a standard 147bp DNA fragment. Thus, current approaches are not appropriate for detecting fragile nucleosomes.
To address the need for approaches that can characterize fragile nucleosomes, we developed the Size-based Expectation Maximum (SEM) nucleosome analysis package. SEM models the distribution of MNase-seq fragments around nucleosomes via a two-component Gaussian Mixture model. In addition, SEM takes the distribution of protected DNA fragment lengths into consideration to distinguish nucleosome types. Benchmark analysis shows that SEM can achieve competitive performance to existing nucleosome calling packages in predicting nucleosome dyad location, occupancy, and fuzziness.
Applied to a low MNase-concentration H2B MNase-ChIP-seq dataset from mouse embryonic stem cells, SEM discovers three nucleosome types: short-fragment nucleosomes; canonical nucleosomes; and di-nucleosomes. Depending on whether they are located within accessible regions, short-fragment nucleosomes can be further classified into two subtypes. One set of short-fragment nucleosomes are located within accessible regions, exhibit relatively high MNase sensitivity, and display a similar distribution pattern around TSS and CTCF peaks as the previously reported fragile nucleosomes. Another set of short-fragment nucleosomes (hereafter called “non-canonical” nucleosomes) are located outside accessible regions and display a high enrichment of weak nucleotides at the exit/entry sites. Directly related to this A/T enrichment, we observed a relatively high enrichment of several transcription factor binding motifs, such as Fox family factors. Although MNase A/T biased digestion may cause the sensitivity feature for these non-canonical nucleosomes, we suggest the motifs at non-canonical nucleosome’s entry/exit sites could potentially serve as transcription factor engaging sites.
In summary, SEM provides an effective platform for characterizing distinct nucleosome subtypes and will facilitate a deeper characterization of fragile nucleosomes.

34: A practical comparison of the next-generation sequencing platform, depth, and assembly software using yeast genome
COSI: rsg
  • Min-Seung Jeon, Department of Life Science, Chung-Ang University, South Korea
  • Da Min Jeong, Department of Life Science, Chung-Ang University, South Korea
  • Huijeong Doh, Department of Life Science, Chung-Ang University, South Korea
  • Hyun Ah Kang, Department of Life Science, Chung-Ang University, South Korea
  • Hyungtaek Jung, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Australia
  • Seong-il Eyun, Department of Life Science, Chung-Ang University, South Korea


Presentation Overview: Show

Assembling fragmented whole-genomic information from the sequencing data is an inevitable process for further genome-wide research. However, it is intricate to select the appropriate assembly pipeline for unknown species since the optimal whole-genome assembly strategy can be different according to the species-specific genomic properties. Therefore, this study focused on the synergy between the characteristics of sequencing platforms and assembly algorithms which have relatively more static proclivities than the fickle genome sequences. A total of 208 draft and polished de novo assemblies were constructed under the different sequencing platforms and assembly algorithms with repetitive yeast genome. Our comprehensive data indicated that sequencing reads from Oxford Nanopore with R7.3 flow cell generated more continuous assemblies than those derived from PacBio Sequel. However, nanopore-based assemblies also showed homopolymer-based assembly errors and chimeric contigs in assembly results except for Canu and SPAdes. Additionally, the comparison between two second-generation sequencing (SGS) platforms showed that Illumina NovaSeq 6000 provides more accurate and continuous assembly in the SGS-first pipeline, but MGI DNBSEQ-T7 provides a cheap and accurate read in the polishing process. Furthermore, our insight into the relationship among the computational time, read length, and coverage depth provided clues to the optimal pipelines for different assembly purposes.

35: Mapping genotype to phenotype through joint probabilistic modeling of single-cell gene expression and chromosomal copy number variation
COSI: rsg
  • Linyue Fan, Columbia University, United States
  • Isha Arora, Cornell University, United States
  • Alexander Preau, Columbia University, United States
  • Yiping Wang, Columbia University, United States
  • Johannes Melms, Columbia University, United States
  • Amit Dipak Amin, Columbia University, United States
  • Yohanna Georgis, Columbia University, United States
  • Patricia Ho, Columbia University, United States
  • Lindsay Caprio, Columbia University, United States
  • Antoni Ribas, UCLA, United States
  • Alison Taylor, Columbia University, United States
  • Benjamin Izar, Columbia University, United States
  • Elham Azizi, Columbia University, United States


Presentation Overview: Show

Large-scale copy number variation (CNV) is a major driver of tumor heterogeneity, which is the phenomenon by which cellular clones within the tumor may demonstrate differing disease states and responses to therapy. CNVs are traditionally profiled at the bulk level through whole genome sequencing (WGS), and clonal structure is inferred from this bulk data. Understanding these genomic changes along with respective transcriptional output at a single-cell resolution has the potential to inform both tumorigenesis, as well as genomic and phenotypic determinants of drug response/resistance. Here, we propose ECHIDNA, a novel probabilistic model that leverages single-cell RNA sequencing (scRNA-seq) data to deconvolve concurrently collected, population-matched bulk WGS data into constituent clones defined by a Dirichlet process. Jointly, we factorize scRNA-seq data into cell and gene latent factors [1] driven by the inferred CNV that may be examined for biological insight. These latent factors are learned with an additional temporal dimension, corresponding to samples collected over the course of treatment, while the inferred CNV are uniform across all time points. This temporal axis allows for the connection of a constant clonal genotype to temporally dynamic changes in cell phenotype, while the integration of both data modalities increases the robustness of both deconvolution and definition of clones. We apply the model to biopsies from melanoma patients undergoing anti-PD1 immunotherapy, and demonstrate accurate deconvolution as well as the ability to study the impact of therapy on diverse tumor clones.

[1] Levitin, Hanna Mendes, Yuan, Jinzhou, Cheng, Yim Ling. De novo gene signature identification from single-cell RNA-seq with hierarchical Poisson factorization. Mol Syst Biol (2019)

36: Dream challenge 2022: Prediction of regulatory activity with GC-correction
COSI: dream
  • Pyaree Mohan Dash, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Germany
  • Sebastian Röner, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Germany
  • Martin Kircher, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Germany
  • Max Schubach, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Germany


Presentation Overview: Show

Understanding gene regulation is a crucial step towards the interpretation of sequence alterations causing disease. Given the rise of high-throughput functional assays that can measure the regulatory activity of genomic elements, it is still impossible to test the large universe of all potential variants. Therefore, predictive models are built to learn and estimate the regulatory potential of DNA sequences. Here, we present a simple convolutional neural network that was trained on promoter sequences to predict expression data of a high-throughput assay from the DREAM Challenge 2022 (syn28469146). We used a classification approach to predict the 18 different expression classes of the assay. Because experimental data may be biased towards GC content (0.34 Pearson correlation on this data), we implemented a GC correction step for the training data. This way, the model can focus on motifs within the sequence rather than general nucleotide composition. Finally, our model was able to predict the potential activity of short and random promotor sequences with a test Pearson correlation of 0.93. Our unique GC-aware training strategy made the model more interpretable towards learned motifs and enables the prediction of sequence alteration effects, a step forward towards understanding disease.

37: Barriers to detecting intracellular transcriptional regulatory signals in co-expression
COSI: rsg
  • Eric Ching-Pan Chu, University of British Columbia, Canada
  • Alexander Morin, University of British Columbia, Canada
  • Marine Louarn, University of British Columbia, Canada
  • Paul Pavlidis, University of British Columbia, Canada


Presentation Overview: Show

Interpretation of RNA co-expression as reflecting regulatory networks is commonplace in genomics research. For instance, clusters of co-expressed genes are frequently used to infer transcription factors that regulate them at the cellular level. Despite the application of this approach in thousands of publications, the extent to which intracellular transcriptional regulatory patterns are preserved and detectable in co-expression is unclear, especially in bulk tissue data. For example, cell type compositional differences may be a significant driver of co-expression patterns in bulk tissue data, and dilution of cell type specific signals may also have considerable impact. In this study, we used a novel computational simulator and analysis of publicly-available data sets to quantify the effects of these and other factors on the propagation of cellular co-expression signals to the bulk tissue level. We show that aggregation of signals from multiple cell types in bulk tissue samples can drastically dilute the strength of most cell type specific co-expression patterns. As expected, we found that distortion of cell-level co-expression patterns by cellular compositional effects adds to the challenge. One notable exception to the disconnect between cell-level and tissue-level co-expression is provided by ribosomal protein genes. Co-expression of these genes, which are known to be co-regulated, is one of the most robust findings in the field. We show that this is primarily due to apparent synchrony of inter-subject differences in their genes’ expression patterns among different cell types as well as cell-type differences in expression, leading to striking patterns in both single-cell and bulk tissue data. However, our analysis shows the ribosome to be an exceptional outlier. Our results provide a quantitative explanation for why regulatory network inference from co-expression has proved challenging - even with the assistance of other data modalities - and gives the scientific community a set of tools to further explore these issues in both single-cell and bulk tissue data.